Goto

Collaborating Authors

 diagonal linear network


Inductive biases of multi-task learning and finetuning: multiple regimes of feature reuse

Neural Information Processing Systems

Neural networks are often trained on multiple tasks, either simultaneously (multi-task learning, MTL) or sequentially (pretraining and subsequent finetuning, PT+FT). In particular, it is common practice to pretrain neural networks on a large auxiliary task before finetuning on a downstream task with fewer samples. Despite the prevalence of this approach, the inductive biases that arise from learning multiple tasks are poorly characterized. In this work, we address this gap.




S)GD over Diagonal Linear Networks Implicit Bias Large and Edge of Stability

Neural Information Processing Systems

Currently, most theoretical works on implicit regularisation have primarily focused on continuous time approximations of (S)GD where the impact of crucial hyperparameters such as the stepsize and the minibatch size are ignored. One such common simplification is to analyse gradient flow, which is a continuous time limit of GD and minibatch SGD with an infinitesimal stepsize. By definition, this analysis does not capture the effect of stepsize or stochasticity.




(S)GD over Diagonal Linear Networks: Implicit bias, Large Stepsizes and Edge of Stability

Neural Information Processing Systems

In this paper, we investigate the impact of stochasticity and large stepsizes on the implicit regularisation of gradient descent (GD) and stochastic gradient descent (SGD) over $2$-layer diagonal linear networks. We prove the convergence of GD and SGD with macroscopic stepsizes in an overparametrised regression setting and characterise their solutions through an implicit regularisation problem. Our crisp characterisation leads to qualitative insights about the impact of stochasticity and stepsizes on the recovered solution. Specifically, we show that large stepsizes consistently benefit SGD for sparse regression problems, while they can hinder the recovery of sparse solutions for GD. These effects are magnified for stepsizes in a tight window just below the divergence threshold, in the ``edge of stability'' regime. Our findings are supported by experimental results.


Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity

Neural Information Processing Systems

Understanding the implicit bias of training algorithms is of crucial importance in order to explain the success of overparametrised neural networks. In this paper, we study the dynamics of stochastic gradient descent over diagonal linear networks through its continuous time version, namely stochastic gradient flow. We explicitly characterise the solution chosen by the stochastic flow and prove that it always enjoys better generalisation properties than that of gradient flow.Quite surprisingly, we show that the convergence speed of the training loss controls the magnitude of the biasing effect: the slower the convergence, the better the bias. To fully complete our analysis, we provide convergence guarantees for the dynamics. We also give experimental results which support our theoretical claims. Our findings highlight the fact that structured noise can induce better generalisation and they help explain the greater performances of stochastic gradient descent over gradient descent observed in practice.


Saddle-to-Saddle Dynamics in Diagonal Linear Networks

Neural Information Processing Systems

In this paper we fully describe the trajectory of gradient flow over $2$-layer diagonal linear networks for the regression setting in the limit of vanishing initialisation. We show that the limiting flow successively jumps from a saddle of the training loss to another until reaching the minimum $\ell_1$-norm solution. We explicitly characterise the visited saddles as well as the jump times through a recursive algorithm reminiscent of the LARS algorithm used for computing the Lasso path. Starting from the zero vector, coordinates are successively activated until the minimum $\ell_1$-norm solution is recovered, revealing an incremental learning. Our proof leverages a convenient arc-length time-reparametrisation which enables to keep track of the transitions between the jumps. Our analysis requires negligible assumptions on the data, applies to both under and overparametrised settings and covers complex cases where there is no monotonicity of the number of active coordinates. We provide numerical experiments to support our findings.


Heavy-Ball Momentum Method in Continuous Time and Discretization Error Analysis

Lyu, Bochen, Zhang, Xiaojing, Zheng, Fangyi, Wang, He, Wang, Zheng, Zhu, Zhanxing

arXiv.org Artificial Intelligence

This paper establishes a continuous time approximation, a piece-wise continuous differential equation, for the discrete Heavy-Ball (HB) momentum method with explicit discretization error. Investigating continuous differential equations has been a promising approach for studying the discrete optimization methods. Despite the crucial role of momentum in gradient-based optimization methods, the gap between the original discrete dynamics and the continuous time approximations due to the discretization error has not been comprehensively bridged yet. In this work, we study the HB momentum method in continuous time while putting more focus on the discretization error to provide additional theoretical tools to this area. In particular, we design a first-order piece-wise continuous differential equation, where we add a number of counter terms to account for the discretization error explicitly. As a result, we provide a continuous time model for the HB momentum method that allows the control of discretization error to arbitrary order of the step size. As an application, we leverage it to find a new implicit regularization of the directional smoothness and investigate the implicit bias of HB for diagonal linear networks, indicating how our results can be used in deep learning. Our theoretical findings are further supported by numerical experiments.